Naive Bayes and Map-Reduce
نویسنده
چکیده
We’ll start out with a very simple learning algorithm: multinomial Naive Bayes. Our implementation is in Table 1. Each training example is a labeled document d = (i, y, (w1, . . . , wni)) with an identifier i, a label y from a small set Y = {y1, . . . , yK}, and a “bag of words”. The bag of words are wj’s, encoded here as a list of strings, so that wj is the word/token at position j of document i. For example, the bag of words for the paragraph below would be the strings: “when”, “scaling”, “to”, . . . , “large”, “datasets”, and ”:”. The test documents are the same, but their labels are unknown. Our goal is to read a training set, build a classifier, and then predict a label for each test example. When scaling to large datasets, the first thing to remember is that main memory (RAM) is, relatively speaking, limited and expensive, while disk space is less limited and cheap. So we want to carefully control how much memory we use. In particular, if you care about processing large datasets:
منابع مشابه
A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کاملDiagnosis of Pulmonary Tuberculosis Using Artificial Intelligence (Naive Bayes Algorithm)
Background and Aim: Despite the implementation of effective preventive and therapeutic programs, no significant success has been achieved in the reduction of tuberculosis. One of the reasons is the delay in diagnosis. Therefore, the creation of a diagnostic aid system can help to diagnose early Tuberculosis. The purpose of this research was to evaluate the role of the Naive Bayes algorithm as a...
متن کاملThe naive Bayes text classification algorithm based on rough set in the cloud platform
This paper improves the naïve bayesian classification algorithm , combining with the rough set theory we can get a naive bayesian classifier algorithm based on the rough set. We implement this algorithm on a cloud platform using map-reduce programming mode and get a excellent result. A recall rate of 76.4 was achieved when classifying Tibetan Web pages .
متن کاملIn silico prediction of anticancer peptides by TRAINER tool
Cancer is one of the causes of death in the world. Several treatment methods exist against cancer cells such as radiotherapy and chemotherapy. Since traditional methods have side effects on normal cells and are expensive, identification and developing a new method to cancer therapy is very important. Antimicrobial peptides, present in a wide variety of organisms, such as plants, amphibians and ...
متن کاملMap-Reduce for Machine Learning on Multicore
We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many differe...
متن کاملOn Why Discretization Works for Naive-Bayes Classifiers
We investigate why discretization is effective in naive-Bayes learning. We prove a theorem that identifies particular conditions under which discretization will result in naiveBayes classifiers delivering the same probability estimates as would be obtained if the correct probability density functions were employed. We discuss the factors that might affect naive-Bayes classification error under ...
متن کامل